Embedding global barrier and collective in a torus network

ABSTRACT

Embodiments of the invention provide a method, system and computer program product for embedding a global barrier and global interrupt network in a parallel computer system organized as a torus network. The computer system includes a multitude of nodes. In one embodiment, the method comprises taking inputs from a set of receivers of the nodes, dividing the inputs from the receivers into a plurality of classes, combining the inputs of each of the classes to obtain a result, and sending said result to a set of senders of the nodes. Embodiments of the invention provide a method, system and computer program product for embedding a collective network in a parallel computer system organized as a torus network. In one embodiment, the method comprises adding to a torus network a central collective logic to route messages among at least a group of nodes in a tree structure.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of co-pending U.S. patent applicationSer. No. 12/723,277, filed Mar. 12, 2010, which claims the benefit ofU.S. Provisional Patent Application Serial. No. 61/293,611, filed Jan.8, 2010. The entire content and disclosure of U.S. patent applicationSer. No. 12/723,277 and 61/293,611 are hereby incorporated herein byreference.

CROSS REFERENCE

The present invention is related to the following commonly-owned,co-pending United States patent applications filed on even dateherewith, the entire contents and disclosure of each of which isexpressly incorporated by reference herein as if fully set forth herein.U.S. patent application Serial No. (YOR920090171US1 (24255)), for “USINGDMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY”; U.S. patentapplication Serial No. (YOR920090169US1 (24259)) for “HARDWARE SUPPORTFOR COLLECTING PERFORMANCE COUNTERS DIRECTLY TO MEMORY”; U.S. patentapplication Serial No. (YOR920090168US1 (24260)) for “HARDWARE ENABLEDPERFORMANCE COUNTERS WITH SUPPORT FOR OPERATING SYSTEM CONTEXTSWITCHING”; U.S. patent application Serial No. (YOR920090473US1(24595)), for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FASTRECONFIGURATION OF PERFORMANCE COUNTERS”; U.S. patent application SerialNo. (YOR920090474US1 (24596)), for “HARDWARE SUPPORT FOR SOFTWARECONTROLLED FAST MULTIPLEXING OF PERFORMANCE COUNTERS”; U.S. patentapplication Serial No. (YOR920090533US1 (24682)), for “CONDITIONAL LOADAND STORE IN A SHARED CACHE”; U.S. patent application Serial No.(YOR920090532US1 (24683)), for “DISTRIBUTED PERFORMANCE COUNTERS”; U.S.patent application Serial No. (YOR920090529US1 (24685)), for “LOCALROLLBACK FOR FAULT-TOLERANCE IN PARALLEL COMPUTING SYSTEMS”; U.S. patentapplication Serial No. (YOR920090530US1 (24686)), for “PAUSE PROCESSORHARDWARE THREAD UNTIL PIN TO PROCESSOR WAKE ON PIN”; U.S. patentapplication Serial No. (YOR920090526US1 (24687)), for “PRECAST THERMALINTERFACE ADHESIVE FOR EASY AND REPEATED, SEPARATION AND REMATING”; U.S.patent application Serial No. (YOR920090527US1 (24688), for “ZONEROUTING IN A TORUS NETWORK”; U.S. patent application Serial No.(YOR920090531US1 (24689)), for “PROCESSOR WAKEUP UNIT TO PROCESSORRESUME UNIT”; U.S. patent application Serial No. (YOR920090535US1(24690)), for “TLB EXCLUSION RANGE”; U.S. patent application Serial No.(YOR920090536US1 (24691)), for “DISTRIBUTED TRACE USING CENTRALPERFORMANCE COUNTER MEMORY”; U.S. patent application Serial No.(YOR920090538US1 (24692)), for “PARTIAL CACHE LINE SPECULATION SUPPORT”;U.S. patent application Serial No. (YOR920090539US1 (24693)), for“ORDERING OF GUARDED AND UNGUARDED STORES FOR NO-SYNC I/O”; U.S. patentapplication Serial No. (YOR920090540US1 (24694)), for “DISTRIBUTEDPARALLEL MESSAGING FOR MULTIPROCESSOR SYSTEMS”; U.S. patent applicationSerial No. (YOR920090541US1 (24695)), for “SUPPORT FOR NON-LOCKINGPARALLEL RECEPTION OF PACKETS BELONGING TO THE SAME MESSAGE”; U.S.patent application Serial No. (YOR920090560US1 (24714)), for “OPCODECOUNTING FOR PERFORMANCE MEASUREMENT”; U.S. patent application SerialNo. (YOR920090578US1 (24724)), for “MULTI-INPUT AND BINARY REPRODUCIBLE,HIGH BANDWIDTH FLOATING POINT ADDER IN A COLLECTIVE NETWORK”; U.S.patent application Serial No. (YOR920090579US1 (24731)), for “AMULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER”; U.S. patentapplication Serial No. (YOR920090581US1 (24732)), for “CACHE DIRECTORYLOOK-UP REUSE”; U.S. patent application Serial No. (YOR920090582US1(24733)), for “MEMORY SPECULATION IN A MULTI LEVEL CACHE SYSTEM”; U.S.patent application Serial No. (YOR920090583US1 (24738)), for “METHOD ANDAPPARATUS FOR CONTROLLING MEMORY SPECULATION BY LOWER LEVEL CACHE”; U.S.patent application Serial No. (YOR920090584US1 (24739)), for “MINIMALFIRST LEVEL CACHE SUPPORT FOR MEMORY SPECULATION MANAGED BY LOWER LEVELCACHE”; U.S. patent application Serial No. (YOR920090585US1 (24740)),for “PHYSICAL ADDRESS ALIASING TO SUPPORT MULTI-VERSIONING IN ASPECULATION-UNAWARE CACHE”; U.S. patent application Serial No.(YOR920090587US1 (24746)), for “LIST BASED PREFETCH”; U.S. patentapplication Serial No. (YOR920090590US1 (24747)), for “PROGRAMMABLESTREAM PREFETCH WITH RESOURCE OPTIMIZATION”; U.S. patent applicationSerial No. (YOR920090595US1 (24757)), for “FLASH MEMORY FOR CHECKPOINTSTORAGE”; U.S. patent application Serial No. (YOR920090596US1 (24759)),for “NETWORK SUPPORT FOR SYSTEM INITIATED CHECKPOINTS”; U.S. patentapplication Serial No. (YOR920090597US1 (24760)), for “TWO DIFFERENTPREFETCH COMPLEMENTARY ENGINES OPERATING SIMULTANEOUSLY”; U.S. patentapplication Serial No. (YOR920090598US1 (24761)), for “DEADLOCK-FREECLASS ROUTES FOR COLLECTIVE COMMUNICATIONS EMBEDDED IN AMULTI-DIMENSIONAL TORUS NETWORK”; U.S. patent application Serial No.(YOR920090631US1 (24799)), for “IMPROVING RELIABILITY AND PERFORMANCE OFA SYSTEM-ON-A-CHIP BY PREDICTIVE WEAR-OUT BASED ACTIVATION OF FUNCTIONALCOMPONENTS”; U.S. patent application Serial No. (YOR920090632US1(24800)), for “A SYSTEM AND METHOD FOR IMPROVING THE EFFICIENCY OFSTATIC CORE TURN OFF IN SYSTEM ON CHIP (SoC) WITH VARIATION”; U.S.patent application Serial No. (YOR920090633US1 (24801)), for“IMPLEMENTING ASYNCHRONOUS COLLECTIVE OPERATIONS IN A MULTI-NODEPROCESSING SYSTEM”; U.S. patent application Serial No. (YOR920090586US1(24861)), for “MULTIFUNCTIONING CACHE”; U.S. patent application SerialNo. (YOR920090646US1 (24874)) for ARBITRATION IN CROSSBAR FOR LOWLATENCY; U.S. patent application Serial No. (YOR920090647US1 (24875))for EAGER PROTOCOL ON A CACHE PIPELINE DATAFLOW; U.S. patent applicationSerial No. (YOR920090648US1 (24876)) for EMBEDDED GLOBAL BARRIER ANDCOLLECTIVE IN A TORUS NETWORK; U.S. patent application Serial No.(YOR920090649US1 (24877)) for GLOBAL SYNCHRONIZATION OF PARALLELPROCESSORS USING CLOCK PULSE WIDTH MODULATION; U.S. patent applicationSerial No. (YOR920090650US1 (24878)) for IMPLEMENTATION OF MSYNC; U.S.patent application Serial No. (YOR920090651US1 (24879)) for NON-STANDARDFLAVORS OF MSYNC; U.S. patent application Serial No. (YOR920090652US1(24881)) for HEAP/STACK GUARD PAGES USING A WAKEUP UNIT; U.S. patentapplication Serial No. (YOR920100002US1 (24882)) for MECHANISM OFSUPPORTING SUB-COMMUNICATOR COLLECTIVES WITH 0(64) COUNTERS AS OPPOSEDTO ONE COUNTER FOR EACH SUB-COMMUNICATOR; and U.S. patent applicationSerial No. (YOR920100001US1 (24883)) for REPRODUCIBILITY IN BGQ.

GOVERNMENT CONTRACT

This invention was Government supported under Contract No. B554331awarded by Department of Energy. The Government has certain rights inthis invention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention generally relates to parallel computing systems, and morespecifically, to embedding global barrier and collective networks in atorus network.

2. Background Art

Massively parallel computing structures (also referred to as“ultra-scale or “supercomputers”) interconnect large numbers of computenodes, generally, in the form of very regular structures, such as mesh,lattices, or torus configurations. The conventional approach for themost cost/effective ultrascalable computers has been to use processorsconfigured in uni-processors or symmetric multiprocessor (SMP)configurations, wherein the SMPs are interconnected with a network tosupport message passing communications. Today, these supercomputingmachines exhibit computing performance achieving over one peraflops.

One family of such massively parallel computers has been developed bythe International Business Machines Corporation (IBM) under the nameBlue Gene. Two members of this family are the Blue Gene/L system and theBlue Gene/P system. The Blue Gene/L system is a scalable system havingover 65,000 compute nodes. Each node is comprised of a singleapplication specific integrated circuit (ASIC) with two CPUs and memory.The full computer system is housed in sixty-four racks or cabinets withthirty-two node boards, or a thousand nodes, in each rack.

The Blue Gene/L computer system structure can be described as a computenode core with an I/O node surface, where communication to the computenodes is handled by the I/0 nodes. In the compute node core, the computenodes are arranged into both a logical tree structure and amulti-dimensional torus network. The logical tree network connects thecompute nodes in a tree structure so that each node communicates with aparent and one or two children. The torus network logically connects thecompute nodes in a three-dimensional lattice like structure that allowseach compute node to directly connect with its closest 6 neighbors in asection of the computer.

In massively parallel computing structures, multiple network paradigmsare implemented to interconnect nodes for use individually orsimultaneously and include three high-speed networks for parallelalgorithm message passing. Additional networks are provided for externalconnectivity used for Input/Output, System Management and Configuration,and Debug and Monitoring services for the supercomputer nodes. Thehigh-speed networks preferably include n-dimensional Torus, Global Tree,and Global Signal configurations. The use of each of these networks mayswitch back and forth based on algorithmic needs or phases ofalgorithms. For example, parts of calculations may be performed on theTorus, or part on the global Tree which facilitates the development ofnew parallel algorithms that simultaneously employ multiple networks innovel ways.

With respect to the Global Tree network, one primary functionality is tosupport global broadcast (down-tree) and global reduce (up-tree)operations. Additional functionality is provided to support programmablepoint-to-point or sub-tree messaging used for input/output, programload, system management, parallel job monitoring and debug. Thisfunctionality enables “service” or input/output nodes to be isolatedfrom the Torus so as not to interfere with parallel computation. Thatis, all nodes in the Torus may operate at the full computational rate,while service nodes off-load asynchronous external interactions. Thisensures scalability and repeatability of the parallel computation sinceall nodes performing the computation operate at the full and consistentrate. Preferably, the global tree supports the execution of thosemathematical functions implementing reduction messaging operations.Preferably, the Global Tree network additionally supports multipleindependent virtual channels, allowing multiple independent globaloperations to proceed simultaneously. The design is configurable and theratio of computation nodes to service nodes is flexible depending onrequirements of the parallel calculations. Alternate packagingstrategies allow any ratio, including a machine comprised of all serviceor input/output nodes, as would be ideal for extremely data-intensivecomputations.

A third network includes a Global Signal Network that supportscommunications of multiple asynchronous ‘signals’ to provide globallogical “AND” or “OR” functionality. This functionality is specificallyprovided to support global barrier operations (“AND”), for indicating toall nodes that, for example, all nodes in the partition have arrived ata specific point in the computation or phase of the parallel algorithm,and, global notification (“OR”) functionality, for indicating to allnodes that, for example, one or any node in the partition has arrived ata particular state or condition. Use of this network type enablestechnology for novel parallel algorithms, coordination, and systemmanagement.

On previous generation BlueGene/L (BG/L) and BlueGene/P (BG/P)supercomputers, besides the high speed 3-dimension torus network, thereare also dedicated collective and global barrier networks. They have theadvantage of independence among different networks, but also have asignificant drawback in terms of (1) extra high speed pins on chip,resulting in extra packaging cost, and (2) harder to design applicablepartitioning in packaging because the 3 networks have a differenttopology.

BRIEF SUMMARY

Embodiments of the invention provide a method, system and computerprogram product for embedding a global barrier and global interruptnetwork in a parallel computer system organized as a torus network. Thecomputer system includes a multitude of nodes, and each of the nodes hasa plurality of receivers and a plurality of senders. In one embodiment,the method comprises taking inputs from a set of the receivers, dividingthe inputs from the receivers into a plurality of classes, combining theinputs of each of the classes into a logical OR to obtain a result, andsending said result to a set of the senders. In an embodiment thecombining of the inputs is a logical OR operation.

In an embodiment, the sending includes sending said result to the set ofthe set of senders to create a global barrier among a given set of saidnodes. In one embodiment, the torus network is separated into aplurality of partitions, and the sending includes sending said resultsto the set of senders to create a global barrier among all of the nodeswithin one of said partitions.

In one embodiment, the multitude of nodes includes a plurality ofclasses of nodes, each of said classes including compute nodes and I/Onodes, and said result is an effective logical OR of all inputs from allthe compute nodes and the I/O nodes within a given one of said classes.In one embodiment, the method further comprises when one of the sendersdetects that a local barrier state of said one of the senders haschanged to a new barrier state, said one of the senders sending said newbarrier state to one of the receivers. In one embodiment, this sendingsaid new barrier state includes using a global barrier packet to sendsaid new barrier state. In an embodiment, said global barrier packetidentifies a packet type and a barrier state.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Embodiments of the invention provide a method, system and computerprogram product for embedding a collective network in a parallelcomputer system organized as a torus network. The computer systemincludes a multitude of nodes, and each of the nodes has a plurality ofreceivers and a plurality of senders. In one embodiment, the methodcomprises adding to the torus network a central collective logic toroute messages among at least a group of said nodes in a tree structure,wherein, at defined times, one of said group of nodes is a root node andthe others of said group of nodes are leaf or intermediate nodes. Themethod also comprises routing messages from the leaf or intermediatenodes to the root node in an up tree direction; processing the messagesbeing routed from the leaf or intermediate nodes to the root node toform a processed message; and sending the processed message back fromthe root node to at least one of the leaf or intermediate nodes.

FIG. 1 depicts a unit cell of a three-dimensional torus implemented in amassively parallel supercomputer.

FIG. 2 is a block diagram of a node of the supercomputer.

FIG. 3 is a block diagram showing a messaging unit and associatednetwork logic that may be used in an embodiment of the invention.

FIG. 4 is a logic block diagram of one of the receivers shown in FIG. 3.

FIG. 5 is a logic block diagram of one of the senders shown in FIG. 3.

FIG. 6 shows the format of a collective data packet.

FIG. 7 illustrates the format of a point-to-point data packet.

FIG. 8 is a diagram of the central collective logic block of FIG. 3.

FIG. 9 depicts an arbitration process that may be used in an embodimentof the invention.

FIG. 10 illustrates a GLOBAL_BARRIER PACKET type in accordance with anembodiment of the invention.

FIG. 11 shows global collective logic that is used in one embodiment ofthe invention.

FIG. 12 illustrates global barrier logic that may be used in anembodiment of the invention.

FIG. 13 shows an example of a collective network embedded in a 2-D torusnetwork.

DETAILED DESCRIPTION

As will be appreciated by one skilled in the art, the present inventionmay be embodied as a system, method or computer program product.Accordingly, the present invention may take the form of an entirelyhardware embodiment, an entirely software embodiment (includingfirmware, resident software, micro-code, etc.) or an embodimentcombining software and hardware aspects that may all generally bereferred to herein as a “circuit,” “module” or “system.” Furthermore,the present invention may take the form of a computer program productembodied in any tangible medium of expression having computer usableprogram code embodied in the medium.

Any combination of one or more computer usable or computer readablemedium(s) may be utilized. The computer-usable or computer-readablemedium may be, for example but not limited to, an electronic, magnetic,optical, electromagnetic, infrared, or semiconductor system, apparatus,device, or propagation medium. More specific examples (a non-exhaustivelist) of the computer-readable medium would include the following: anelectrical connection having one or more wires, a portable computerdiskette, a hard disk, a random access memory (RAM), a read-only memory(ROM), an erasable programmable read-only memory (EPROM or Flashmemory), an optical fiber, a portable compact disc read-only memory(CDROM), an optical storage device, a transmission media such as thosesupporting the Internet or an intranet, or a magnetic storage device.Note that the computer-usable or computer-readable medium could even bepaper or another suitable medium, upon which the program is printed, asthe program can be electronically captured, via, for instance, opticalscanning of the paper or other medium, then compiled, interpreted, orotherwise processed in a suitable manner, if necessary, and then storedin a computer memory. In the context of this document, a computer-usableor computer-readable medium may be any medium that can contain, store,communicate, propagate, or transport the program for use by or inconnection with the instruction execution system, apparatus, or device.The computer-usable medium may include a propagated data signal with thecomputer-usable program code embodied therewith, either in baseband oras part of a carrier wave. The computer usable program code may betransmitted using any appropriate medium, including but not limited towireless, wireline, optical fiber cable, RF, etc.

Computer program code for carrying out operations of the presentinvention may be written in any combination of one or more programminglanguages, including an object oriented programming language such asJava, Smalltalk, C++ or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The program code may execute entirely on the user's computer,partly on the user's computer, as a stand-alone software package, partlyon the user's computer and partly on a remote computer or entirely onthe remote computer or server. In the latter scenario, the remotecomputer may be connected to the user's computer through any type ofnetwork, including a local area network (LAN) or a wide area network(WAN), or the connection may be made to an external computer (forexample, through the Internet using an Internet Service Provider).

The present invention is described below with reference to flowchartillustrations and/or block diagrams of methods, apparatus (systems) andcomputer program products according to embodiments of the invention. Itwill be understood that each block of the flowchart illustrations and/orblock diagrams, and combinations of blocks in the flowchartillustrations and/or block diagrams, can be implemented by computerprogram instructions. These computer program instructions may beprovided to a processor of a general purpose computer, special purposecomputer, or other programmable data processing apparatus to produce amachine, such that the instructions, which execute via the processor ofthe computer or other programmable data processing apparatus, createmeans for implementing the functions/acts specified in the flowchartand/or block diagram block or blocks. These computer programinstructions may also be stored in a computer-readable medium that candirect a computer or other programmable data processing apparatus tofunction in a particular manner, such that the instructions stored inthe computer-readable medium produce an article of manufacture includinginstruction means which implement the function/act specified in theflowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer orother programmable data processing apparatus to cause a series ofoperational steps to be performed on the computer or other programmableapparatus to produce a computer implemented process such that theinstructions which execute on the computer or other programmableapparatus provide processes for implementing the functions/actsspecified in the flowchart and/or block diagram block or blocks.

The present invention relates to embedding global barrier and collectivenetworks in a parallel computing system organized as a torus network.The invention may be implemented, in an embodiment, in a massivelyparallel computer architecture, referred to as a supercomputer. As amore specific example, the invention, in an embodiment, may beimplemented in a massively parallel computer developed by theInternational Business Machines Corporation (IBM) under the name BlueGene/Q. The Blue Gene/Q is expandable to 512 compute racks, each with1024 compute node ASICs (BQC) including 16 PowerPC A2 processor cores at1600 MHz. Each A2 core has associated a quad-wide fused multiply-addSIMD floating point unit, producing 8 double precision operations percycle, for a total of 128 floating point operations per cycle percompute chip. Cabled as a single system, the multiple racks can bepartitioned into smaller systems by programming switch chips, termed theBG/Q Link ASICs (BQL), which source and terminate the optical cablesbetween midplanes.

Each compute rack is comprised of 2 sets of 512 compute nodes. Each setis packaged around a doubled-sided backplane, or midplane, whichsupports a five-dimensional torus of size 4×4×4×4×2 which is thecommunication network for the compute nodes which are packaged on 16node boards. This tori network can be extended in 4 dimensions throughlink chips on the node boards, which redrive the signals optically withan architecture limit of 64 to any torus dimension. The signaling rateis 10 Gb/s, ( 8/10 encoded), over ˜20 meter multi-mode optical cables at850 nm. As an example, a 96-rack system is connected as a 16×16×16×12×2torus, with the last ×2 dimension contained wholly on the midplane. Forreliability reasons, small torus dimensions of 8 or less may be run as amesh rather than a torus with minor impact to the aggregate messagingrate.

The Blue Gene/Q platform contains four kinds of nodes: compute nodes(CN), I/O nodes (ION), login nodes (LN), and service nodes (SN). The CNand ION share the same compute ASIC.

In addition, associated with a prescribed plurality of processing nodesis a dedicated node that comprises a quad-processor with externalmemory, for handling of I/O communications to and from the computenodes. Each I/O node has an operating system that can handle basic tasksand all the functions necessary for high performance real time code. TheI/O nodes contain a software layer above the layer on the compute nodesfor handling host communications. The choice of host will depend on theclass of applications and their bandwidth and performance requirements.

In an embodiment, each compute node of the massively parallel computerarchitecture is connected to six neighboring nodes via sixbi-directional torus links, as depicted in the three-dimensional torussub-cube portion shown at 10 in FIG. 1. It is understood, however, thatother architectures comprising more or fewer processing nodes indifferent torus configurations (i.e., different numbers of racks) mayalso be used.

The ASIC that powers the nodes is based on system-on-a-chip (s-o-c)technology and incorporates all of the functionality needed by thesystem. The nodes themselves are physically small allowing for a veryhigh density of processing and optimizing cost/performance.

Referring now to FIG. 2, there is shown the overall architecture of themultiprocessor computing node 50 implemented in a parallel computingsystem in which the present invention is implemented. In one embodiment,the multiprocessor system implements the proven Blue Gene® architecture,and is implemented in a BlueGene/Q massively parallel computing systemcomprising, for example, 1024 compute node ASICs (BQC), each includingmultiple processor cores.

A compute node of this present massively parallel supercomputerarchitecture and in which the present invention may be employed isillustrated in FIG. 2. The compute node 50 is a single chip (“nodechip”)based on low power A2 PowerPC cores, though the architecture can use anyprocessor cores, and may comprise one or more semiconductor chips. Inthe embodiment depicted, the node includes 16 PowerPC A2 at 1600 MHz, incores in one embodiment.

More particularly, the basic nodechip 50 of the massively parallelsupercomputer architecture illustrated in FIG. 2 includes (sixteen orseventeen) 16+1 symmetric multiprocessing (SMP) cores 52, each corebeing 4-way hardware threaded supporting transactional memory and threadlevel speculation, and, including a Quad Floating Point Unit (FPU) 53 oneach core (204.8 GF peak node). In one implementation, the coreoperating frequency target is 1.6 GHz providing, for example, a 563 GB/saggregated memory bandwidth to shared L2 cache 70 via a full crossbarswitch 60. In one embodiment, there is provided 32 MB of shared L2 cache70, each core having associated 2 MB of L2 cache 72. There is furtherprovided external DDR SDRAM (e.g., Double Data Rate synchronous dynamicrandom access) memory 80, as a lower level in the memory hierarchy incommunication with the L2. In one embodiment, the node includes 42.6GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip killprotection).

Each FPU 53 associated with a core 52 has a 32B wide data path to theL1-cache 55 of the A2, allowing it to load or store 32B per cycle fromor into the L1-cache 55. Each core 52 is directly connected to a privateprefetch unit (level-1 prefetch, L1P) 58, which accepts, decodes anddispatches all requests sent out by the A2. The load interface from theA2 core 52 to the L1P 55 is 32B wide and the store interface is 16Bwide, both operating at processor frequency. The L1P 55 implements afully associative, 32 entry prefetch buffer. Each entry can hold an L2line of 128B size. The L1P provides two prefetching schemes for theprivate prefetch unit 58: a sequential prefetcher as used in previousBlueGene architecture generations, as well as a list prefetcher.

As shown in FIG. 2, the 32MiB shared L2 is sliced into 16 units, eachconnecting to a slave port of the switch 60. Every physical address ismapped to one slice using a selection of programmable address bits or aXOR-based hash across all address bits. The L2-cache slices, the L1Psand the L1-D caches of the A2s are hardware-coherent. A group of 8slices is connected via a ring to one of the two DDR3 SDRAM controllers78.

An embodiment of the invention implements a direct memory access enginereferred to herein as a Messaging Unit, “MU” such as MU 100, with eachMU including 3 XBAR master interfaces, 1 XBAR slave interface, a numberof DMA engines for processing packets and interfaces to the Networklogic unit. In one embodiment, the compute node further includes, in anon-limiting example: 10 intra-rack interprocessor links 90, each at 2.0GB/s, for example, i.e., 10*2 GB/s intra-rack & inter-rack (e.g.,configurable as a 5-D torus in one embodiment); and, one I/O link 92interfaced with the MU at 2.0 GB/s (2 GB/s I/O link (to I/O subsystem))is additionally provided. The system node employs or is associated andinterfaced with a 8-16 GB memory/node.

Although not shown, each A2 core has associated a quad-wide fusedmultiply-add SIMD floating point unit, producing 8 double precisionoperations per cycle, for a total of 128 floating point operations percycle per compute chip. A2 is a 4-way multi-threaded 64b PowerPCimplementation. Each A2 core has its own execution unit (XU),instruction unit (IU), and quad floating point unit (QPU) connected viathe AXU (Auxiliary eXecution Unit) (FIG. 2). The QPU is animplementation of the 4-way SIMD QPX floating point instruction setarchitecture. QPX is an extension of the scalar PowerPC floating pointarchitecture. It defines 32 32B-wide floating point registers per threadinstead of the traditional 32 scalar 8B-wide floating point registers.

The BG/Q network is a 5-dimensional (5-D) torus for the compute nodes.In a compute chip, besides the 10 bidirectional links to support the 5-Dtorus, there is also a dedicated I/O link running at the same speed asthe 10 torus links that can be connected to an I/O node.

The BG/Q torus network originally supports 3 kind of packet types: (1)point-to-point DATA packets from 32 bytes to 544 bytes, including a 32byte header and a 0 to 512 bytes payload in multiples of 32 bytes, asshown in FIG. 7; (2) 12 byte TOKEN_ACK (token and acknowledgement)packets, not shown; (3) 12 byte ACK_ONLY (acknowledgement only) packets,not shown.

FIG. 3 shows the messaging unit and the network logic block diagramsthat may be used on a computer node in one embodiment of the invention.The torus network is comprised of (1) Injection fifos 302, (2) receptionfifos 304, (3) receivers 306, and (4) senders 308. The injection fifosinclude: 10 normal fifos, 2 KB buffer space each; 2 loopback fifos, 2 KBeach; 1 high priority and 1 system fifo, 4 KB each. The Reception fifosinclude: 10 normal fifos tied to individual receiver, 2 KB each; 2loopback fifos, 2 kB each; 1 high priority and 1 system fifo, 4 KB each.Also, in one embodiment, the torus network includes eleven receivers 306and eleven senders 308.

The receiver logic diagram is shown in FIG. 4. Each receiver has fourvirtual channels (VC) with 4 KB of buffers: one dynamic VC 402, onedeterministic VC 404, one high priority VC 406, and one system VC 408.

The sender logic block diagram is shown in FIG. 5. Each sender has an 8KB retransmission fifo 502. The DATA and TOKEN_ACK packets carry linklevel sequence number and are stored in the retransmission fifo. Both ofthese packets will get acknowledgement back via either TOKEN_ACK orACK_ONLY packets on the reverse link when they are successfullytransmitted over electrical or optical cables. If there is a link error,then the acknowledgement will not be received and a timeout mechanismwill lead to re-transmissions of these packets until they aresuccessfully received by the receiver on the other end. The ACK_ONLYpackets do not carry a sequence number and are sent over each linkperiodically.

To embed a collective network over the 5-D torus, a new collective DATApacket type is supported by the network logic. The collective DATApacket format shown in FIG. 6 is similar in structure to thepoint-to-point DATA packet format shown in FIG. 7. The packet type ×“55”in byte 0 of the point-to-point DATA packet format is replaced by a newcollective DATA packet type ×“5A”. The point-to-point routing bits inbyte 1, 2 and 3 are replaced by collective operation code, collectiveword length and collective class route, respectively. The collectiveoperation code field indicates one of the supported collectiveoperations, such as binary AND, OR, XOR, unsigned integer ADD, MIN, MAX,signed integer ADD, MIN, MAX, as well as floating point ADD, MIN andMAX.

The collective word length indicates the operand size in units of2^(n)*4 bytes for signed and unsigned integer operations, while thefloating point operand size is fixed to 8 byte (64 bit double precisionfloating point numbers). The collective class route identifies one of 16class routes that are supported on the BG/Q machine. On a single node,the 16 classes are defined in Device Control Ring (DCR) controlregisters. Each class has 12 input bits identifying input ports, for the11 receivers as well as the local input; and 12 output bits identifyingoutput ports, for the 11 senders as well as the local output. Inaddition, each class definition also has 2 bits indicating whether theparticular class is used as user Comm_World (e.g., all compute nodes inthis class), user sub-communicators (e.g, a subset of compute nodes), orsystem Comm_World (e.g., all compute nodes, possibly with I/O nodesserving the compute partition also).

The algorithm for setting up dead-lock free collective classes isdescribed in co-pending patent application YOR920090598US1. An exampleof a collective network embedded in a 2-D torus network is shown in FIG.13. Inputs from all nodes are combined along with the up-tree path, andend up on the root node. The result is then turned around at the rootnode and broadcasted down the virtual tree back to all contributingnodes.

In byte 3 of the collective DATA packet header, bit 3 to bit 4 defines acollective operation type which can be (1) broadcast, (2) all reduce or(3) reduce. Broadcast means one node broadcasts a message to all thenodes, there is no combining of data. In an all-reduce operation, eachcontributing nodes in a class contributes a message of the same length,the input message data in the data packet payload from all contributingnodes are combined according to the collective OP code, and the combinedresult is broadcasted back to all contributing nodes. The reduceoperation is similar to all-reduce, but in a reduce operation, thecombined result is received only by the target node, all other nodeswill discard the broadcast they receive.

In the Blue Gene/Q compute chip (BQC) network logic, two additionalcollective injection fifos (one user+one system) and two collectivereception fifos (one user+one system) are added for the collectivenetwork, as shown in FIG. 3 at 302 and 304. A central collective logicblock 306 is also added. In each of the receivers, two collectivevirtual channels are added, as shown in FIG. 4 at 412 and 414. Eachreceiver also has an extra collective data bus 310 output to the centralcollective logic, as well as collective requests and grants (not shown)for arbitration. In the sender logic, illustrated in FIG. 5, the numberof input data buses to the data mux 504 is expanded by one extra databus coming from the central collective logic block 306. The centralcollective logic will select either the up tree or the down tree datapath for each sender depending on the collective class map of the datapacket. Additional request and grant signals from the central collectivelogic block 306 to each sender are not shown.

A diagram of the central collective logic block 306 is shown in FIG. 8.In an embodiment, there are two separate data paths 802 and 804, Path802 is for uptree combine, and patent 804 for downtree broadcast. Thisallows full bandwidth collective operations without uptree and downtreeintereference. The sender arbitration logic is, in an embodiment,modified to support the collective requests. The uptree combiningoperation for floating point number is further illustrated in co-pendingpatent application YOR920090578US1.

When the torus network is routing point-to-point packets, priority isgiven to system packets. For example, when both user and system requests(either from receivers or from injection fifos) are presented to asender, the network will give grant to one of the system requests.However, when the collective network is embedded into the torus network,there is a possiblity of livelock because at each node, both system anduser collective operations share up-tree and down-tree logic path, andeach collective operation involve more than one node. For example, acontinued stream of system packets going over a sender could block adown-tree user collective on the same node from progressing. Thisdown-tree user collective class may include other nodes that happen tobelong to another system collective class. Because the user down-treecollective already occupies the down-tree collective logic on thoseother nodes, the system collective on the same nodes then can not makeprogress. To avoid the potential livelock between the collective networktraffic and the regular torus network traffic, the arbitration logic inboth the central collective logic and the senders are modified.

In the central collective arbiter, shown in FIG. 9, the followingarbitration priorities are implemented,

(1) down tree system collective, highest priority,

(2) down tree user collective, second priority,

(3) up tree system collective, third priority,

(4) up tree user collective, lowest priority.

In addition, the down-tree arbitration logic in the central collectiveblock also implements a DCR programmable timeout, where if the requestto a given sender does not make progress for a certain time, allrequests to different senders and/or local reception fifo involved inthe broadcast are cancelled and a new request/grant arbitration cyclewill follow.

In the network sender, the arbitration logic priority is furthermodified as follows, in order of descending priority;

-   -   (1) round-robin between regular torus point-to-point system and        collective; when collective is selected, priority is given to        down tree requests;    -   (2) Regular torus point-to-point high priority VC;    -   (3) Regular torus point-to-point normal VCs (dynamic and        deterministic).

On BlueGene/L and BlueGene/P, the global barrier network is a separateand independent network. The same network can be used for (1) global AND(global barrier) operations, or (2) global OR (global notification orglobal interrupt) operations. For each programmable global barrier biton each local node, a global wired logical “OR” of all input bits fromall nodes in a partition is implemented in hardware. The global ANDoperation is achieved by first “arming” the wire, in which case allnodes will program its own bit to ‘1’. After each node participating inthe global AND (global barrier) operation has done “arming” its bit, anode then lowers its bit to ‘0’ when the global barrier function iscalled. The global barrier bit will stay at ‘1’ until all nodes havelowered their bits, therefore achieving a logical global AND operation.After a global barrier, the bit then needs to be re-armed. On the otherhand, to do a global OR (for global notification or global interruptoperation), each node would initially lower its bit, then any one nodecould raise a global attention by programming its own bit to ‘1’.

To embed the global barrier and global interrupt network over theexisting torus network, in one embodiment, a new GLOBAL_BARRIER packettype is used. This packet type, an example of which is shown in FIG. 10at 1000, is also 12 bytes, including: 1 byte type, 3 byte barrier state,1 byte acknowledged sequence number, 1 byte packet sequence number, 6byte Reed-Solomon checking code. This packet is similar to the TOKEN_ACKpacket and is also stored in the retransmission fifo and covered by anadditional link-level CRC.

The logic addition includes each receiver's packet decoder (shown at 416in FIG. 4) decoding the GLOBAL_BARRIER packets, and sends the barrierstate to the central global barrier logic, shown in FIG. 11, The centralcollective logic 1100 takes each receiver's input 24 bits, as well asmemory mapped local node contribution, and then splits all inputs into16 classes, with 3 bits per contributor per class. The class mapdefinition are similar to those in the collectives, i.e, each class has12 input enable bits, and 12 output enable bits. When all 12 outputenable bits are zero, this indicates the current node is the root of theclass, and the input enable bits are used as the output enable bits.Every bit of the 3 bits of the class of the 12 inputs are ANDed with theinput enable, and the result bits are ORed together into a single 3 bitstate for this particular class. The resulting 3 bits of the currentclass then gets replicated 12 times, 3 bits each for each output link.Each output link's 3 bits are then ANDed with the output enable bit, andthe resulting 3 bits are then given to the corresponding sender or tothe local barrier state.

Each class map (collective or global barrier) has 12 input bits and 12output bits. When the bit is high or set to ‘1’, the corresponding portis enabled. A typical class map will have multiple inputs bits set, butonly one output bit set, indicating the up tree link. On the root nodeof a class, all output bits are set to zero, and the logic recognizesthis and uses the input bits for outputs. Both collective and globalbarrier have separated up-tree logic and down-tree logic. When a classmap is defined, except for the root node, all nodes will combine allenabled inputs and send to the one output port in an up-tree combine,then take the one up-tree port (defined by the output class bits) as theinput of the down-tree broadcast, and broadcast the results to all othersenders/local reception defined by the input class bits, i.e., the classmap is defined for up-tree operation, and in the down-tree logic, theactual input and output ports (receivers and senders) are reversed. Atthe root of the tree, all output class bits are set to zero, the logiccombines data (packet data for collective, global barrier state forglobal barrier) from all enabled input ports (receivers), reduces thecombined logic to a single result, and then broadcast the result back toall the enabled outputs (senders) using the same input class bits, i.e.,the result is turned around and broadcast back to all the input links.

FIG. 12 shows the detailed implementation of the up-tree and down-treeglobal barrier combining logic inside block 1100 (FIG. 11). The drawingis shown for one global barrier class c and one global barrier state bitj=3*c+k, where k=0, 1, 2. This logic is then replicated multiple timesfor each class c, and for every input bit k. In the up-tree path, eachinput bit (from receivers and local input global barrier controlregisters) is ANDed with up-tree input class enables for thecorresponding input, the resulting bits is then OR reduced (1220, via atree of OR gates or logically equivalent gates) into a single bit. Thisbit is then fanned out and ANDed with up-tree output class enables toform up_tree_output_state(i, j), where i is the output port number.Similarly, each input bit is also fanned out into the down-tree logic,but with the input and output class enables switched, i.e., down-treeinput bits are enabled by up-tree output class map enables, anddown-tree output bits down_tree_output_state(i,j) are enabled by up-treeinput class map enables. On a normal node, a number of up-tree inputenable bits are set to ‘l’, while only one up-tree output class bit isset to ‘1’. On the root node of the global barrier tree, all outputclass map bits are set to ‘0’, the up-tree state bit is then fed backdirectly to the down tree OR reduce logic 1240. Finally, the up-tree anddown-tree state bits are ORed together for each sender and the localglobal barrier status:

-   -   Sender(i) global barrier state(j)=up_tree_output_state(i,j) OR        down_tree_output_state(i,j);    -   LOCAL global barrier status(j)=up_tree_output_state(i=last,j) OR        down_tree_output_state(i=last,j);

On BlueGene/L and BlueGene/P, each global barrier is implemented by asingle wire per node, the effective global barrier logic is a global ORof all input signals from all nodes. Because there is a physical limitof the largest machine, there is an upper bound for the signalpropagation time, i.e., the round trip latency of a barrier from thefurthest node going up-tree to the root that received the down-treesignal at the end of a barrier tree is limited, typically within aboutone micro-second. Thus a simple timer tick is implemented for eachbarrier, one will not enter the next barrier until a preprogrammed timehas passed. This allows each signal wire on a node to be used as anindependent barrier. However, on BlueGene/Q, when the global barrier isembedded in the torus network, because of the possibility of link errorson the high speed links, and the associated retransmission of packets inthe presence of link errors, it is, in an embodiment, impossible to comeup with a reliable timeout without making the barriers latencyunnecessarily long. Therefore, one has to use multiple bits for a singlebarrier. In fact, each global barrier will require 3 status bits, the 3byte barrier state in Blue Gene/Q therefore supports 8 barriers perphysical link.

To initialize a barrier of a global barrier class, all nodes will firstprogram its 3 bit barrier control registers to “100”, and it then waitsfor its own barrier state to become “100”, after which a differentglobal barrier is called to insure all contributing nodes in thisbarrier class have reached the same initialized state. This globalbarrier can be either a control system software barrier when the firstglobal barrier is being set up, or an existing global barrier in adifferent class that has already been initialized. Once the barrier of aclass is set up, the software then can go through the following stepswithout any other barrier classes being involved. (1) From “100”, thelocal global barrier control for this class is set to “010”, and whenthe first bit of the 3 status bits reaches 0, the global barrier forthis class is achieved. Because of the nature of the global ORoperations, the 2nd bit of the global barrier status bit will reach ‘1’either before or at the same time as the first bit going to ‘0’, i.e.,when the 1^(st) bit is ‘0’, the global barrier status bits will be“010”, but it might have gone through an intermediate “110” state first.(2) For the second barrier, the global barrier control for this class isset from “010” to “001:, i.e., lower the second bit and raise the 3rdbit, and wait for the 2^(nd) bit of status to change from ‘1’ to ‘0’.(3) Similarly, the third barrier is done by setting the control statefrom “001” to “100”, and then waiting for the third bit to go low. Afterthe 3^(rd) barrier, the whole sequence repeats.

An embedded global barrier requires 3 bits, but if configured as aglobal interrupt (global notification), then each of the 3 bit can beused separately, but every 3 notification bits share the same class map.

While the BG/Q network design supports all 5 dimensions labeled A, B, C,D, E symmetrically, in practice, the fifth E dimension, in oneembodiment, is kept at 2 for BG/Q. This allows the doubling of thenumber of barriers by keeping one group of 8 barriers in the E=0 4-Dtorus plane, and the other group of 8 barriers in the E=1 plane. Thebarrier network processor memory interface therefore supports 16barriers. Each node can set a 48 bit global barrier control register,and read another 48 bit barrier state register. There is a total of 16class maps that can be programmed, one for each of 16 barriers. Eachreceiver carries a 24 bit barrier state, so does each sender. Thecentral barrier logic takes all receiver inputs plus local contribution,divides them into 16 classes, then combines them into an OR of allinputs in each class, and the result is then sent to the torus senders.Whenever a sender detects that its local barrier state has changed thesender sends the new barrier state to the next receiver using theGLOBAL_BARRIER packet. This results in an effective OR of all inputsfrom all compute and I/O nodes within a given class map. Global barrierclass maps can also go over the I/O link to create a global barrieramong all compute nodes within a partition.

The above feature of doubling the class map is also used by the embeddedcollective logic. Normally, to support three collective types, i.e.,user Comm_World, user sub_comm, and system, three virtual channels wouldbe needed in each receiver. However, because the fifth dimension is a by2 dimension on BG/Q, user COMM_WORLD can be mapped to one 4-D plane(e=0) and the system can be mapped to another 4-D plane (e=1). Becausethere are no physical links being shared, the user COMM_WORLD and systemcan share a virtual channel in the receiver, shown in FIG. 7 ascollective VC 0, reducing buffers being used.

In one embodiment of the invention, because the 5^(th) dimension is 2,the class map is doubled from 8 to 16. For global barriers, class 0 and8 will use the same receiver input bits, but different groups of thelocal inputs (48 bit local input is divided into 2 groups of 24 bits).Class i (0 to 7) and class i+8 (8 to 15) can not share any physicallinks, these class configuration control bits are under system control.With this doubling, each logic block in FIG. 12 is additionallyreplicated one more time, with the sender output in FIG. 12 furthermodified

-   -   Sender(i) global barrier        state(j)=up_tree_output_state_group0(i,j) OR        down_tree_output_state_group0(i,j) OR        up_tree_output_state_group1(i,j) OR        down_tree_output_state_group1(i,j);        The local state has separate wires for each group (48 bit state,        2 groups of 24 bits) and is unchanged.

The 48 global barrier status bits also feed into an interrupt controlblock. Each of the 48 bits can be separately enabled or masked off forgenerating interrupts to the processors. When one bit in a 3 bit classis configured as a global interrupt, the corresponding global barriercontrol bit is first initialized to zero on all nodes, then theinterrupt control block is programmed to enable interrupt when thatparticular global barrier status bit goes to high (‘1’). After thisinitial setup, any one of the nodes within the class could raise the bitby writing a ‘1’ into its global barrier control register at thespecific bit position. Because the global barrier logic functions as aglobal OR of the control signal on all nodes, the ‘1’ will be propagatedto all nodes in the same class, and trigger a global interrupt on allnodes. Optionally, one can also mask off the global interrupt and have aprocessor poll the global interrupt status instead.

On BlueGene/Q, while the global barrier and global interrupt network isimplemented as a global OR of all global barrier state bits from allnodes (logic 1220 and 1240), it provides both global AND and global ORoperations. Global AND is achieved by utilizing a ‘1’ to ‘0’ transitionon a specific global barrier state bit, and global OR is achieved byutilizing a ‘0’ to ‘1’ transition. In practice, one can also implementthe logic block 1220 and 1240 as AND reduces, where then global AND areachieved with ‘0’ to ‘1’ state transition and global OR with ‘1’ to ‘0’transition. Any logically equivalent implementations to achieve the sameglobal AND and global OR operations should be covered by this invention.

While it is apparent that the invention herein disclosed is wellcalculated to fulfill the objects discussed above, it will beappreciated that numerous modifications and embodiments may be devisedby those skilled in the art, and it is intended that the appended claimscover all such modifications and embodiments as fall within the truespirit and scope of the present invention.

What is claimed is:
 1. A method of embedding a collective network in aparallel computer system organized as a torus network, said computersystem including a multitude of modes, each of the nodes having aplurality of receivers and a plurality of senders, the methodcomprising: adding to the torus network a central collective logic toroute messages among at least a group of said nodes in a tree structure,wherein, at defined times, one of said group of nodes is a root node andthe others of said group of nodes are leaf or intermediate nodes, andincluding routing messages from the leaf or intermediate nodes to theroot node in an up tree direction; processing the messages being routedfrom the leaf or intermediate nodes to the root node to form a processedmessage; and sending the processed message back from the root node to atleast one of the leaf, intermediate or root nodes.
 2. The methodaccording to claim 1, wherein said processing includes combining themessages from the leaf or intermediate nodes into one, combined message.3. The method according to claim 1, wherein the sending includes sendingthe processed message from the root node to all the leaf, intermediateand root nodes in a down tree direction.
 4. The method according toclaim 1, wherein said adding further includes: adding dedicatedinjection fifos and dedicated reception fifos for collective operations;and adding dedicated collective virtual channels to each of thereceivers, and adding a central collective logic block to handlearbitration and collective data flow.
 5. The method according to claim1, wherein one can double the number of classes when the size of a torusdimension is two.
 6. A system for embedding a collective network in aparallel computer system organized as a torus network, said computersystem including a multitude of modes, each of the nodes having aplurality of receivers and a plurality of senders, the system comprisingone or more processing units configured for: providing a centralcollective logic to the torus network to route messages among at least agroup of said nodes in a tree structure, wherein, at defined times, oneof said group of nodes is a root node and the others of said group ofnodes are leaf or intermediate nodes, and including routing messagesfrom the leaf or intermediate nodes to the root node in an up treedirection; processing the messages being routed from the leaf orintermediate nodes to the root node to form a processed message; andsending the processed message back from the root node to at least one ofthe leaf, intermediate or root nodes.
 7. The system according to claim6, wherein said processing includes combining the messages from theleaf, intermediate and root nodes into one, combined message.
 8. Thesystem according to claim 6, wherein the sending includes sending theprocessed message from the root node to all the leaf, intermediate androot nodes in a down tree direction.
 9. The system according to claim 8,wherein said adding further includes adding dedicated injection fifosand dedicated reception fifos for collective operations.
 10. The systemaccording to claim 9, wherein said adding further includes addingdedicated collective virtual channels to each of the receivers, andadding a central collective logic block to handle arbitration andcollective data flow.